GROUP 31:

Email ID:

Team and Project Meta Information

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

image.png

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

Dataset Size

Plotting Missing values Function

Summary of Application train

Missing data for application train

WhatsApp%20Image%202022-04-12%20at%2011.30.07%20PM.jpeg

Summary of Application Test

Missing data for application Test

WhatsApp%20Image%202022-04-12%20at%2011.30.21%20PM.jpeg

Summary of Bureau Balance

Missing data for Beaurau Balance

Summary of Bureau

Missing data for Beaurau

Summary of Credit Card Balance

Missing data for Credit Card Balance

Summary of Installments Payments

Missing data for Installments Payments

Summary of POS CASH Balance

Missing data for POS CASH Balance

Summary of Previous Applications

Missing Data for Previous Applications

Distribution of the target column

Applicants Age

Applicants occupations

Visual EDA

Visualizing the categorical columns to understand the data more efficiently

The number of female borrowing the loan and who haven't paid is comparatively higher than men.

Marital status of client

The bulk of clients who are married have paid the smallest loan amount, while the number of clients with an uncertain status is insignificant.

Percentage of clients who owns a car

About half of the population owns a car, but the majority of clients (more than half) do not, and the majority of them are likely to have defaulted on their loan.

Type of educational background the clients have:

Clients with Academic Degree are more likely to repay the loan compared to others.

The Types of house the client stays in:

We can see from the graph above that the bulk of the clients who live in apartments/houses have not paid their loans, while the number of clients who live in office apartments and co-op apartments is minimal.

Types of loan available:

Many people would rather take out a cash loan than a revolving loan.

Correlation with the target column

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

agg detour

Aggregate using one or more operations over the specified axis.

For more details see agg

DataFrame.agg(func, axis=0, *args, **kwargs**)

Aggregate using one or more operations over the specified axis.

Multiple condition expressions in Pandas

So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.

Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.

You must use the following operators with pandas:

Missing values in prevApps

feature engineering for prevApp table

feature transformer for prevApp table

Join the labeled dataset

Join the unlabeled dataset (i.e., the submission file)

Processing pipeline

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

OHE case study: The breast cancer wisconsin dataset (classification)

Please this blog for more details of OHE when the validation/test have previously unseen unique values.

HCDR preprocessing

Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

WhatsApp Image 2022-04-12 at 5.09.30 PM.jpeg

Evaluation metrics

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.

from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

Kaggle submission via the command line API

report submission

Click on this link

image.png

Write-up

For this phase of the project, you will need to submit a write-up summarizing the work you did. The write-up form is available on Canvas (Modules-> Module 12.1 - Course Project - Home Credit Default Risk (HCDR)-> FP Phase 2 (HCDR) : write-up form ). It has the following sections:

Abstract

Please provide an abstract summarizing the work you did (150 words)

For any financial institution, an important aspect for issuing a loan is to understand if the borrower is capable of repayment given a principal, maturity, and schedule.In this project, we aim to develop a machine learning model which would be based on various characteristics provided to predict a person's behavior and ability to repay.The dataset being used is from “Home Credit Default Risk”. At a high level this dataset consists of application, demographic and historical credit behavior data for various entries.

For phase 1 of this project we understood data by doing basic EDA on all provided datasets, evaluate with baseline pipeline and choosen metrics. During EDA, we intend to watch out for any anomalies, missing and/or irregular data. We plan to do detailed statistical analysis on all provided datasets for numerical, categorical features with visual exploration to come up with the next set of pipelines. Also, to test many machine learning models and then compare and contrast each model to find the unique qualities of each model. And, estimate the best parameters of each model and evaluate.

The Results we got using three various algorithms are:

  1. Logistic Regression : 73.4 %
  2. Naive Bayes : 52.11%
  3. Random Forest : 68.02%

Introduction

Home Credit is a dataset offered by Home Credit Default Risk, a firm that provides unbanked people with lines of credit (loans). application.csv: This is the primary dataset of home Credit which we are using for both the Training and the Testing purpose by dividing the dataset. This file also comprises information on loans and loan applicants at the time of application. bureau.csv: This file provides information on clients' loan histories from financial institutions that have been reported to the Credit Bureau. Each row in this file has the details about the particular client’s loan. And this file has information about many clients. bureau_balance.csv: In this file the records of monthly balances of early credits of the client tracked by Credit Bureau is available.previous_application.csv: This file provides information on the applicant's prior loan in Home credit, it also contains the information about the parameters which were considered previously while giving the loan before and also the information about the client at the time when the loan was provided to him. POS_CASH_balance.csv: In this file all the details about the client like monthly balances (snapshots) of the applicant's prior point of sales (POS) and also the information about cash loans that client took from Home Credit is available in this file. installments_payments.csv: In this file the information about customers' past payment history for each installments from previous loan taken from the Home Credit which is connected to the loan in our sample is present. credit_card_balance.csv: In this file the information about the monthly balance (snapshots) of clients' past credit card history with Home Credit are contained.

Feature Engineering and transformers

For this phase we have removed the columns that have more than 90% null values or missing values. To implement this we have used simpleImputer. The rest of the feature engineering will be continued in the next phase.

Pipelines

All the columns of application_train dataset were divided into either numerical or categorical values. We created seprate pipeline for numerical features and categorical features. The Null and the missing values in the numerical pipeline were replaced by the mean of that particular column using simpleImputer. Similarly for categorical Pipeline the null and the missing values were replaced by the most frequent values and unknown values were ignored. Finally, both the pipelines were merged into a single pipeline We used metrics such as LogLoss, Accuracy Scopre, Confusion Matrix, ROC_AUC_Score Splitted the data into 85% train and 15% validation and then implemented Logistic Regression, Naive Bayes and Random Forest. We used Log Los as the LOS function and calculated the training accuracy, Validation Accuracy, Confusion Matrix and ROC_AUC Curve

Experimental results

The Results we got using three various algorithms are:

  1. Logistic Regression : 73.4 %
  2. Naive Bayes : 52.11%
  3. Random Forest : 68.02%

Discussion

The best model that we could get for Phase 0 was Logistic Regression which gave a training accuracy of 91.9% and the Kaggle submission accuracy of 73.4% Further we are planning to improvise the feature engineering, perform hyperparameter tuning for our models alongside using K-Fold cross validation and GridSearchCV, we might also use some advanced gradient boosting models so that we could get as close to the best accuracy as we can.

Conclusion

In this phase we cleaned the data and selected just the characteristics that were relevant to the target variable and prediction. We featured the data performing OHE and applied imputing methods to fix the data before feeding it to the model. We were able to create the baseline pipeline and could experimentally understand the accuracies of the models like logistic regression, naive bayes and Random forest. Based on the results of the models we saw that there might be underfitting in naive bayes and overfitting in RandomForest. The best model that we could get for Phase 0 was Logistic Regression which gave a training accuracy of 91.9% and the Kaggle submission accuracy of 73.4% Further we are planning to improvise the feature engineering, perform hyperparameter tuning for our models alongside using K-Fold cross validation and GridSearchCV, we might also use some advanced gradient boosting models so that we could get as close to the best accuracy as we can. After the aforementioned step we also plan to apply Deep Learning techniques like developing Artificial Neural Networks for better prediction results.

Kaggle Submission

Please provide a screenshot of your best kaggle submission.
The screenshot should show the different details of the submission and not just the score.

Screen%20Shot%202022-04-12%20at%2010.46.10%20PM.png

Screen%20Shot%202022-04-12%20at%2010.49.18%20PM.png

Screen%20Shot%202022-04-12%20at%2010.38.34%20PM.png

References

Some of the material in this notebook has been adopted from here

  1. https://datavizpyr.com/visualizing-missing-data-with-seaborn-heatmap-and-displot/
  2. https://www.kaggle.com/c/home-credit-default-risk/data

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: